Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
The following notebook explores the use of semantic embeddings to create taxonomies of entities, with the goal of creating an ontology of the ARIA dataset. It leverages embeddings from the tags identified in articles extracted from OpenAlex, and uses them to cluster entities into groups. This taxonomy is assumed hierarchical.
Multiple options are explored to create the taxonomy, including:
The utils for this notebook include all necessary functions to create the taxonomy, including class ClusteringRoutine that performs any of the clustering methods described above. The function run_clustering_generators is used to run all clustering methods and return the results in a dictionary. The function make_dataframe is used to create a dataframe with the results of the clustering methods. The function make_plots is used to create a series of plots to visualize the results of the clustering methods. The function make_cooccurrences is used to create a co-occurrence matrix of the clustering results. The function make_subplot_embeddings is used to create a series of subplots with the embeddings of the entities in the taxonomy.
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import umap.umap_ as umap
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, OPTICS
from hdbscan import HDBSCAN
from itertools import product
from toolz import pipe
from collections import defaultdict
from itertools import chain
from functools import partial
from dap_aria_mapping import PROJECT_DIR
from dap_aria_mapping.getters.openalex import get_openalex_entities
from dap_aria_mapping.notebooks.exploration_utils import (
get_sample,
filter_entities,
embed,
make_subplot_embeddings,
make_dataframe,
make_plots,
make_cooccurrences,
run_clustering_generators
)
bucket_name = 'aria-mapping'
np.random.seed(42)
The entity tags are obtained from OpenAlex. Filtering is applied to remove entities that are too frequent or too infrequent. The entities are then embedded using the SPECTER model. In addition, two-dimensional representations of the embeddings are obtained using UMAP, for plotting purposes.
openalex_entities = pipe(
get_openalex_entities(),
partial(get_sample, score_threshold=80, num_articles=3_000),
partial(filter_entities, min_freq=60, max_freq=95),
)
# embeddings
embeddings = pipe(
openalex_entities.values(),
lambda oa: chain(*oa),
set,
list,
partial(embed, model="sentence-transformers/allenai-specter"),
)
print(embeddings.shape)
# UMAP
params = [
["n_neighbors", [5]],
["min_dist", [0.05]],
["n_components", [2]],
]
keys, permuts = ([x[0] for x in params], list(product(*[x[1] for x in params])))
param_perms = [{k: v for k, v in zip(keys, perm)} for perm in permuts]
for perm in param_perms:
embeddings_2d = umap.UMAP(**perm).fit_transform(embeddings)
fig = plt.figure(figsize=(10, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=1)
fig.suptitle(f"{perm}")
plt.show()
2022-12-21 21:12:25,428 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials 2022-12-21 21:24:58,809 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: sentence-transformers/allenai-specter 2022-12-21 21:24:59,928 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device: cpu
Batches: 0%| | 0/255 [00:00<?, ?it/s]
(8133, 768)
The following clustering routine iteratively clusters the entity embeddings using Agglomerative Clustering and KMeans. At each level, clustering is performed on the subsets that were created at the previous level.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 3, "n_init": 5}, # parent level
{"n_clusters": 2, "n_init": 5}, # nested level 1
{"n_clusters": 10, "n_init": 5},# nested level 2
],
],
[
AgglomerativeClustering,
[
{"n_clusters": 3}, # parent level
{"n_clusters": 2}, # nested level 1
{"n_clusters": 2} # nested level 2
],
],
]
# run clustering generators
cluster_outputs_s, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
# plot results
fig, axis = plt.subplots(2, 3, figsize=(24, 16), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
_, lvl = divmod(idx, 3)
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e) for e in cdict.values()],
axis=axis.flat[idx],
label=f"{cluster[-1]} {str(lvl)}",
s=4,
)
fig.savefig(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "clustering_strict.png")
# print silhouettes
for output in cluster_outputs_s:
print(
"Silhouette score - {} clusters - {}: {}".format(
output["model"][-1].__module__,
output["model"][-1].get_params()["n_clusters"],
output["silhouette"],
)
)
Silhouette score - sklearn.cluster._kmeans clusters - 3: [0.05761689] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.05761689, 0.018488815] Silhouette score - sklearn.cluster._kmeans clusters - 10: [0.05761689, 0.018488815, 0.017510219] Silhouette score - sklearn.cluster._agglomerative clusters - 3: [0.03492154] Silhouette score - sklearn.cluster._agglomerative clusters - 2: [0.03492154, 0.02315808] Silhouette score - sklearn.cluster._agglomerative clusters - 2: [0.03492154, 0.02315808, -0.00731145]
The following clustering routine iteratively clusters the entity embeddings using KMeans. At each level, clustering is performed on the subsets that were created at the previous level. The number of clusters at each level is allowed to vary, being determined by the size of the parent cluster.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 3, "n_init": 5}, # parent level
{"n_clusters": 10, "n_init": 5},# nested level 1, total n_clusters is 10+
{"n_clusters": 10, "n_init": 5},# nested level 2, total n_clusters is 10+
],
],
]
# run clustering generators with imbalanced nested clusters
cluster_outputs_simb, plot_dicts = run_clustering_generators(cluster_configs, embeddings, imbalanced=True)
# plot results
fig, axis = plt.subplots(1, 3, figsize=(24, 8), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
labels = [int(e) for e in cdict.values()]
di = dict(zip(sorted(set(labels)), range(len(set(labels)))))
labels = [di[label] for label in labels]
_, lvl = divmod(idx, 3)
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=labels,
axis=axis.flat[idx],
label=f"{cluster[-1]} {str(lvl)}",
s=4,
)
fig.savefig(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "clustering_strict_imb.png")
# print silhouettes
for output in cluster_outputs_simb:
print(
"Silhouette score - {} clusters - {}: {}".format(
output["model"][-1].__module__,
output["model"][-1].get_params()["n_clusters"],
output["silhouette"],
)
)
Silhouette score - sklearn.cluster._kmeans clusters - 3: [0.058070976] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.058070976, -0.0006202007] Silhouette score - sklearn.cluster._kmeans clusters - 2: [0.058070976, -0.0006202007, -0.003069942]
The following approach iteratively clusters the entity embeddings using any sklearn method that supports the predict_proba method. No notion of level exists through this approach: more fine-grained clusterings are agnostic about the parent cluster output. Including several lists of parameter values will produce outputs for the Cartesian product of all parameter values within a clustering method.
cluster_configs = [
[KMeans, [{"n_clusters": [5, 10, 25, 50], "n_init": 5}]], # level 1, level 2, level 3, level 4
[AgglomerativeClustering, [{"n_clusters": [5, 10, 25, 50]}]], # level 1, level 2, level 3, level 4
[DBSCAN, [{"eps": [0.05, 0.1], "min_samples": [4, 8]}]], # level 1, level 2, level 3, level 4
[HDBSCAN, [{"min_cluster_size": [4, 8], "min_samples": [4, 8]}]], # level 1, level 2, level 3, level 4
]
# run clustering generators with fuzzy clusters
cluster_outputs_f_, plot_dicts = run_clustering_generators(cluster_configs, embeddings)
# plot results
fig, axis = plt.subplots(4, 4, figsize=(40, 40), dpi=200)
for idx, (cdict, cluster) in enumerate(plot_dicts):
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e) for e in cdict.values()],
axis=axis.flat[idx],
label=f"{cluster[-1].__module__}",
cmap="gist_ncar",
)
fig.savefig(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "clustering_fuzzy.png")
# print silhouettes
for cluster in cluster_outputs_f_:
print(
"Silhouette score - {}: {}".format(
cluster["model"][-1], cluster["silhouette"]
)
)
Silhouette score - KMeans(n_clusters=5, n_init=5): [0.018971728] Silhouette score - KMeans(n_clusters=10, n_init=5): [0.011747708] Silhouette score - KMeans(n_clusters=25, n_init=5): [0.019473195] Silhouette score - KMeans(n_clusters=50, n_init=5): [0.026868459] Silhouette score - AgglomerativeClustering(n_clusters=5): [0.020661362] Silhouette score - AgglomerativeClustering(n_clusters=10): [-0.015053057] Silhouette score - AgglomerativeClustering(n_clusters=25): [0.0011482539] Silhouette score - AgglomerativeClustering(n_clusters=50): [0.007710151] Silhouette score - DBSCAN(eps=0.05, min_samples=4): [0] Silhouette score - DBSCAN(eps=0.05, min_samples=8): [0] Silhouette score - DBSCAN(eps=0.1, min_samples=4): [0] Silhouette score - DBSCAN(eps=0.1, min_samples=8): [0] Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=4): [0.121384725] Silhouette score - HDBSCAN(min_cluster_size=4, min_samples=8): [0.050819315] Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=4): [-0.09623869] Silhouette score - HDBSCAN(min_cluster_size=8, min_samples=8): [0.02264695]
This approach uses a single run of any sklearn clustering method that supports a children_ attribute. The children_ attribute is used to recreate th dendrogram that produced the clustering, which is then used to create the taxonomy. The climbing algorithm advances one union of subtrees at a time. The number of levels is determined by the dendrogram_levels parameter.
cluster_configs = [[AgglomerativeClustering, [{"n_clusters": 100}]]]
# run clustering generators with dendrograms
cluster_outputs_d, plot_dicts = run_clustering_generators(cluster_configs, embeddings, dendrogram_levels=6)
# plot results
fig, axis = plt.subplots(2, 3, figsize=(24, 16), dpi=200)
for i, ax in zip(range(6), axis.flat):
make_subplot_embeddings(
embeddings=embeddings_2d,
clabels=[int(e[i]) for e in cluster_outputs_d["labels"].values()],
axis=ax,
label=f"denrogram - level {i}",
s=4,
)
fig.savefig(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "clustering_dendrogram.png")
This approach uses any number of nested KMeans clustering runs. After a given level, the centroids of the previous level are used as the new data points for the next level.
cluster_configs = [
[
KMeans,
[
{"n_clusters": 400, "n_init": 5, "centroids": False},
{"n_clusters": 200, "n_init": 5, "centroids": True},
{"n_clusters": 20, "n_init": 5, "centroids": True},
{"n_clusters": 5, "n_init": 5, "centroids": True},
],
],
]
# run clustering generators with centroids
cluster_outputs_c, plot_dicts = run_clustering_generators(
cluster_configs, embeddings, embeddings_2d=embeddings_2d
)
# [HACK] flip order, should be fixed in run_clustering_generators (should run highest level → lowest level)
for output_dict in cluster_outputs_c:
for k,v in output_dict["labels"].items():
output_dict["labels"][k] = v[::-1]
output_dict["silhouette"] = output_dict["silhouette"][::-1]
# plot results
fig, axis = plt.subplots(1, 4, figsize=(32, 8), dpi=200)
for idx, cdict in enumerate(cluster_outputs_c):
if not cdict.get("centroid_params", False):
axis[idx].scatter(
embeddings_2d[:, 0],
embeddings_2d[:, 1],
c=[e for e in cdict["labels"].values()],
s=1,
)
else:
axis[idx].scatter(
cdict["centroid_params"]["n_embeddings_2d"][:, 0],
cdict["centroid_params"]["n_embeddings_2d"][:, 1],
c=cdict["model"][idx].labels_,
s=cdict["centroid_params"]["sizes"],
)
print(f"Silhouette score ({idx}): {cdict['silhouette']}")
fig.savefig(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "clustering_centroids.png")
Silhouette score (0): [0.03577604] Silhouette score (1): [-0.021420388, 0.03577604] Silhouette score (2): [0.035893414, -0.021420388, 0.03577604] Silhouette score (3): [0.0706989, 0.035893414, -0.021420388, 0.03577604]
This section outputs silhouette scores for all relevant outputs above. It also constructs barplots of the cluster sizes for each level of the taxonomy across approaches.
# Harmonize cluster outputs for analysis
# [HACK] - fix this. For exports, I create a single dictionary for the fuzzy clusters
cluster_outputs_f = []
for group in ["sklearn.cluster._kmeans", "sklearn.cluster._agglomerative"]:
dict_group = {
"labels": defaultdict(list),
"model": [],
"silhouette": [],
"centroid_params": None
}
cluster_outpu = [x for x in cluster_outputs_f_ if x["model"][0].__module__ == group]
for clust in cluster_outpu:
for k, v in clust["labels"].items():
dict_group["labels"][k].append(v[0])
dict_group["model"].append("_".join([clust["model"][0].__module__.replace(".", ""), str(clust["model"][0].get_params()["n_clusters"])]))
dict_group["silhouette"].append(clust["silhouette"][0])
cluster_outputs_f.append(dict_group)
strict_kmeans_df = make_dataframe(cluster_outputs_s[2], "_strict")
strict_agglom_df = make_dataframe(cluster_outputs_s[5], "_strict")
strict_kmeans_imb_df = make_dataframe(cluster_outputs_simb[-1], "_strict_imbalanced")
fuzzy_kmeans_df = make_dataframe(cluster_outputs_f[0], "_fuzzy")
fuzzy_agglom_df = make_dataframe(cluster_outputs_f[1], "_fuzzy")
dendrogram_df = make_dataframe(cluster_outputs_d, "")
centroid_kmeans_df = make_dataframe(cluster_outputs_c[-1], "_centroids", cumulative=False)
make_plots(strict_kmeans_df)
make_plots(strict_agglom_df)
make_plots(strict_kmeans_imb_df)
make_plots(fuzzy_kmeans_df)
make_plots(dendrogram_df)
make_plots(centroid_kmeans_df)
results = {
"kmeans_strict": cluster_outputs_s[2]["silhouette"],
"agglom_strict": cluster_outputs_s[5]["silhouette"],
"kmeans_strict_imb": cluster_outputs_simb[-1]["silhouette"],
"kmeans_fuzzy": cluster_outputs_f[0]["silhouette"],
"agglom_fuzzy": cluster_outputs_f[1]["silhouette"],
"agglomerative_dendrogram": cluster_outputs_d["silhouette"],
"kmeans_centroid": cluster_outputs_c[-1]["silhouette"],
}
results = {"_".join([k,str(id)]): e for k,v in results.items() for id, e in enumerate(v)}
silhouette_df = pd.DataFrame(results, index=["silhouette"]).T.sort_values(
"silhouette", ascending=False
)
silhouette_df.to_pickle(PROJECT_DIR / "dap_aria_mapping" / "notebooks" / "tmp" / "silhouette_df.pkl")
display(silhouette_df)
| silhouette | |
|---|---|
| agglomerative_dendrogram_0 | 0.111172 |
| kmeans_centroid_0 | 0.070699 |
| kmeans_strict_imb_0 | 0.058071 |
| kmeans_strict_0 | 0.057617 |
| agglomerative_dendrogram_2 | 0.037812 |
| kmeans_centroid_1 | 0.035893 |
| kmeans_centroid_3 | 0.035776 |
| agglom_strict_0 | 0.034922 |
| agglomerative_dendrogram_1 | 0.034922 |
| kmeans_fuzzy_3 | 0.026868 |
| agglomerative_dendrogram_3 | 0.024250 |
| agglom_strict_1 | 0.023158 |
| agglom_fuzzy_0 | 0.020661 |
| kmeans_fuzzy_2 | 0.019473 |
| kmeans_fuzzy_0 | 0.018972 |
| kmeans_strict_1 | 0.018489 |
| kmeans_strict_2 | 0.017510 |
| kmeans_fuzzy_1 | 0.011748 |
| agglom_fuzzy_3 | 0.007710 |
| agglom_fuzzy_2 | 0.001148 |
| kmeans_strict_imb_1 | -0.000620 |
| kmeans_strict_imb_2 | -0.003070 |
| agglom_strict_2 | -0.007311 |
| agglomerative_dendrogram_4 | -0.010194 |
| agglom_fuzzy_1 | -0.015053 |
| kmeans_centroid_2 | -0.021420 |
| agglomerative_dendrogram_5 | -0.025106 |
Following the approach of Juan in the AFS repository, we combine the clustering methods to produce a matrix of entity co-occurrences. The objective is to apply community detection algorithms on this.
list_dfs = [
strict_kmeans_df,
strict_agglom_df,
strict_kmeans_imb_df,
fuzzy_kmeans_df,
fuzzy_agglom_df,
dendrogram_df,
centroid_kmeans_df
]
meta_cluster_df = (
pd.concat(list_dfs, axis=1)
.reset_index()
.rename(columns={"index": "tag"})
)